Tiffany Chan
Feature Selection, Model Selection, and Tuning
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
conc = pd.read_csv('concrete.csv')
conc.head(5)
conc.shape
#Finding the data types (dtypes) of all variables in the new dataset.
conc.dtypes
Each variable is a continuous variable in the dataset. All are floats, and therefore have decimal values except age. Age has no fractional values or decimals but is nonetheless a continuous variable.
# Descriptive statistics of all variables in the dataset.
# Components shown in the following table: Name, range of values observed( min-max), mean, median (50% quartile), standard deviation, and quartiles.
conc.describe()
Understanding the meaning of the independent variables by looking at the codebook and pairing those definitions with descriptive statistics makes sense. Also, researching the topic, can be very useful.
For this assignment, our goal is to understand which variables can predict the strength of concrete. To make concrete, there are 4 main materials: cement, water, coarse aggregate (stones, rocks, etc.), and fine aggregate (sand, finer particulate). For this reason, the means of these columns are a lot higher than the rest.
Age is defined in days.
The other variables like slag (blast furnace slag), ash (or fly ash), and superplastic (or Superplasticizer) can be considered as secondary or additional materials used to strengthen concrete. Slag improves cement workability, improves long-term compressive and flexural performance, reduces permeability, as well as reduce concrete environmental damage. Ash, compared to concrete, absorbs moisture more easily and therefore can make concrete more stable and stronger. Suplerplasticizer is a cement water reducer, but too much superplasticizer can also destroy concrete's chemical composition, and therefore it would make sense that its amount is a lot lower compared to the other materials.
#Histograms of all the continuous variables to see the spread and shape of the data (Analysis of body of distributions/tails)..
columns = list(conc)[:] # Creating a new list with all columns
conc[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); # Histogram of all columns
Although some variables like ash, slag and superplastic have excessive 0s, they should not be inputed because they are only secondary materials that may not be necessary to the making of concrete. 0 in this case are true values and hold meaning.
Let's look at these plots a little closer.
sns.distplot(conc['cement'])
sns.distplot(conc['slag'])
sns.distplot(conc['ash'])
sns.distplot(conc['water'])
sns.distplot(conc['superplastic'])
sns.distplot(conc['coarseagg'])
sns.distplot(conc['fineagg'])
sns.distplot(conc['age'])
sns.distplot(conc['strength'])
None of the variables are normally distributed. Strength is somewhat similar to a normally distributed bell-curve. Other variables seem to have skewed tails on the left and right. This could potentially mean there are outliers in this dataset that we may need to handle.
#Looking for missing values in each variable
conc.isnull().sum()
There are no missing values in the dataset.
We will now observe if there are any outliers present.
#Looking for outliers that lie beyond the whiskers of boxplots for every variable.
sns.boxplot(conc['cement'])
sns.boxplot(conc['slag'])
sns.boxplot(conc['ash'])
sns.boxplot(conc['water'])
sns.boxplot(conc['superplastic'])
sns.boxplot(conc['coarseagg'])
sns.boxplot(conc['fineagg'])
sns.boxplot(conc['age'])
sns.boxplot(conc['strength'])
There are multiple outliers in many of the variables. We will handle the outliers later in this analysis.
#Duplicates
#test_list = list(set(test_list))
#Duplicate cases except the first instance.
concduplicaterows = conc[conc.duplicated(keep='first')]
print(concduplicaterows)
#Find the number of sets of duplicates using groupby function. Disregard the sum function.
print(concduplicaterows.groupby(by=["cement","slag","ash","water","superplastic","coarseagg","fineagg","age","strength"]).sum())
There are duplicates in the dataset. The challenge here is to decide whether they should be removed or not. In making some of the models like linear regression, duplicates can bias the results. However, although these values appear twice or three times in the dataset, we are not really certain whether these are actual duplicates because the dataset contains no ID or Case numbers. These points could be real data that coincidentally have identical values. There could very well be 2 or 3 concrete samples having the same results. In total, there are 11 sets of duplicates in this dataset (25 extra cases). Since 25 cases out of 1030 is only 2.4%, changes made to these cases would not make such a big impact on the statistical findings of the entire dataset.
Before deleting duplicates, one should always consider:
In this situation, we don't know the answer to these two concerns.
sns.pairplot(conc);
sns.heatmap(conc.corr(), annot = True);
After evaluating Pearson's correlation matrix for the variables, there are oly 2 relationships that are moderately correlated.
#Visualizing bivariate analysis of 2 continuous variables (One is an independent variable, and the other is the dependent target variable)
#Strength is the target variable of out analysis.
#Cement x strength
conc_cem_strength = sns.jointplot(x = 'strength', y = 'cement', data = conc, kind = 'reg', color="#4CB391")
conc_cem_strength.set_axis_labels(xlabel='Strength', ylabel='Cement')
#Slag x Strength
conc_slag_strength = sns.jointplot(x = 'strength', y = 'slag', data = conc, kind = 'reg', color="#CE75A7")
conc_slag_strength.set_axis_labels(xlabel='Strength', ylabel='Slag')
#Ash x Strength
conc_ash_strength = sns.jointplot(x = 'strength', y = 'ash', data = conc, kind = 'reg', color="#727ee4")
conc_ash_strength.set_axis_labels(xlabel='Strength', ylabel='Ash')
#Water x Strength
conc_water_strength = sns.jointplot(x = 'strength', y = 'water', data = conc, kind = 'reg', color="#f2013a")
conc_water_strength.set_axis_labels(xlabel='Strength', ylabel='Water')
#Superplasticizer x Strength
conc_super_strength = sns.jointplot(x = 'strength', y = 'superplastic', data = conc, kind = 'reg', color="#8f1271")
conc_super_strength.set_axis_labels(xlabel='Strength', ylabel='Superplasticizer')
#Coarse Aggregate x Strength
conc_coarseagg_strength = sns.jointplot(x = 'strength', y = 'coarseagg', data = conc, kind = 'reg', color="#642815")
conc_coarseagg_strength.set_axis_labels(xlabel='Strength', ylabel='Coarse Aggregate')
#Fine Aggregate x Strength
conc_fineagg_strength = sns.jointplot(x = 'strength', y = 'fineagg', data = conc, kind = 'reg', color="#22962c")
conc_fineagg_strength.set_axis_labels(xlabel='Strength', ylabel='Fine Aggregate')
#Age x Strength
conc_age_strength = sns.jointplot(x = 'strength', y = 'age', data = conc, kind = 'reg', color="#1d2ba5")
conc_age_strength.set_axis_labels(xlabel='Strength', ylabel='Age')
None of the variables seem to fit perfectly on a straight line. However, you never know. It would still be worth it to see how linear regression peforms on this dataset.
There are multiple opportunities to extract new features from existing features in this dataset. In certain cases,
Age can be re-categorized so that we can make a feature that may better generalize the test data. After researching on concrete, age can be broken down into unsatisfactory, minimum, standard, extra number of aging days.
When mixing aggregate (coarse aggregate or stone) with sand (fine aggregate), there is supposed to be a 2:1 ratio. We can create a new feature that captures the ratio between these two features.
There is also a water to cement ratio that exists and it should be 0.45-0.6. We can create another feature that captures the quatitifying relationship between these two features.
In terms of the superplasticizer, the percentage generally ranges from 0-7% of the entire mixture of all concrete components. Perhaps, we can create a new feature looking at this percentage in relation to the entire mixture in kg/m^3. Since all the predictor variables have the same units of measurement, we can easily calculate this.
#New extracted feature from 'age': 'agedays'
#Finding mode for age to see if the data agrees with the literature online.
#The standard number of days it takes to make concrete according to literature is 28 days.
import statistics
print("Mode for age variable:")
print(statistics.mode(conc['age']))
#The mode shows that most of the cases in this dataset followed the standard number of concrete ageing days.
print("")
#Review the frequency of the other number of days.
print("Frequency for age variable:")
print(conc['age'].value_counts())
#After doing some research on how the number of days can improve compressive strength, I discovered 28 days is the arbitrary standard number of days.
#7 days equals 75% of concrete's compressive strength, which is considered the minumum accepted compressive strength.
#Less than 7 days, the strength is not satisfactory.
#More than 28 days allows for assured solidification.
#Re-categorizing 'age' and naming it 'agedays'
#conc['agedays'] = conc['age']
bins = [1, 6, 7, 27, 28, 365]
labels = ['1-6', '7', '8-27', '28', '29-365']
conc['agedays'] = pd.cut(conc.age, bins, labels = labels,include_lowest = True)
#Verifying the new categorized variable:
print(conc['agedays'].value_counts())
#Since this variable can be construed as ordinal, we can replace it with ordinval values
replaceStruct = {"agedays": {"1-6": 1, "7": 2 ,"8-27": 3 , "28": 4, "29-365": 5}}
conc=conc.replace(replaceStruct)
conc.head(10)
#Capturing the coarse aggregate to sand ratio. Standard cement:sand:agg ratio is 1:2:4. Dividing coarse aggregate by fine aggregate would normally result in 2.
conc['agg_sand_ratio'] = conc['coarseagg']/conc['fineagg'] # Standard should be 2. Between 1-2 suggests more rock than sand but not standard measurements. 1 means equal sand:rock measurement. Less than 1 suggests more sand to rock (normally for smaller delicate projects like making a fountain/)
conc.head(10)
#Visualizing the new feature using distplot
sns.distplot(conc['agg_sand_ratio']);
The distplot for the agg_sand_ratio suggests that most projects did not follow standard mixing amounts of coarse to fine aggregate. Most cases still had more rock than sand (Cases > 1.0). There are fewer cases where the rock:sand ratio is < 1; those are most likely small projects that require more sand in the mixture.
#Water:Cement ratio
conc['water_cem_ratio'] = conc['water']/conc['cement']
conc.head(10)
#Visualizing water:cement ratio
sns.distplot(conc['water_cem_ratio']);
Normally, this value is around 0.45-0.6. This distplot seems to reflect that. There are a lot of cases that are crowding in this 0.46-0.6 range, suggesting a lot of the cases in this dataset did follow for the most part standard water:cement mixing.
Now we should try to take care of those outliers in the different variables. This is part of EDA and feature engineering.
#Scale to Range Feature Engineering Technique to Create a New Feature from Water Column.
#This technique doesn't work well with outliers. So, we will have to find a way to deal with the outliers for the water variable.
#Imputing the whisker values in boxplot for the outliers in water.
conc[['water']].boxplot()
q75, q25 = np.percentile(conc['water'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
#RUN THIS CELL TWICE TO HANDLE OUTLIERS ON BOTH ENDS OF THE WHISKERS
#Imputing the outliers above the upper whisker with the upper whisker value
conc.water[conc['water']>232.64] = 232.6499
conc[['water']].boxplot()
#Imputing the outliers below the lower whisker with the lower whisker value. Run twice to make sure.
conc.water[conc['water']<124.25000000000002] = 124.25000000000001
conc[['water']].boxplot()
#Imputing the whisker values in boxplot for the outliers in water.
conc[['slag']].boxplot()
q75, q25 = np.percentile(conc['slag'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
#Imputing the outliers above the upper whisker with the upper whisker value
conc.slag[conc['slag']> 357.375] = 357.375
conc[['slag']].boxplot()
#Imputing the whisker values in boxplot for the outliers in water.
conc[['superplastic']].boxplot()
q75, q25 = np.percentile(conc['superplastic'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
#Imputing the outliers above the upper whisker with the upper whisker value
conc.superplastic[conc['superplastic']> 25.5] = 25.5
conc[['superplastic']].boxplot()
#Imputing the whisker values in boxplot for the outliers in water.
conc[['fineagg']].boxplot()
q75, q25 = np.percentile(conc['fineagg'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
#Imputing the outliers above the upper whisker with the upper whisker value
conc.fineagg[conc['fineagg']> 963.575] = 963.575
conc[['fineagg']].boxplot()
#Imputing the whisker values in boxplot for the outliers in water.
conc[['age']].boxplot()
q75, q25 = np.percentile(conc['age'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
#Imputing the outliers above the upper whisker with the upper whisker value
conc.age[conc['age']> 129.5] = 129.5
conc[['age']].boxplot()
#Making the new feature for superplasticizer percent.
conc['superp_percent'] = conc['superplastic']/(conc['cement'] + conc['slag'] + conc['ash'] + conc['water'] + conc['superplastic'] + conc['coarseagg'] + conc['fineagg'])*100
conc.head(10)
sns.distplot(conc['superp_percent'])
It's important to note that certain regression models are very sensitive to multicollinearity, outliers, and non-standardized/non-normalized data columns. These models include: linear regression, Support Vector Regression, and K Nearest Neighbors. Tree models like Decision Tree and Random Forest are robust enough, where these factors don't really influence the measures. Bagging, and boosting models whose base estimators are decision tree are also robust enough.
I created a pipeline for linear regression, support vector regression and K Nearest Neighbors that contain a scaler(Standardscaler)/normalizer(MinMaxScaler) that will standardize/normalize the data.
Accuracy for a regression model is the R^2 value. Just like measuring accuracy in a classification problem, R^2 is not the most ideal metric to understand how well the model fits the data. R Square can determine how well the model fits the dependent variables and explains how much of the variance in the dependent variable can be explained by the independent variable(s) in the model. However, it does not take into consideration the issue of overfitting. Mean Squared Error, on the hand, is better at measuring a model's goodness of fit and reports underfitting and overfitting. Both mean squared error and root mean square share the same purpose, and in this assignment mean squared error will be measured.
Overall, both R^2 and mean squared error and mean absolute error measure a model's statistical robustness.
#Importing all the necessary packages for our model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
import warnings
warnings.filterwarnings('ignore')
#Separate independent features from target feature.
X = conc.drop(['age','strength'], axis=1)
# the dependent variable
Y = conc[['strength']]
#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)
#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)
#The problem with the linear regression model is that the independent variables have to be standardized so that each variable is on the same scale to limit bias towards features with larger value ranges.
#All the original variables (except for age) all have the same units of measurement and would normally not need to be standardized.
#Because we converted age to an ordinal variable and it is no longer continuous, ordinal variables do not need to be standardized.
#HOWEVER, because I created all these new features that are ratios, and percentages, and are no longer in the same scale as the other continuous variables, using a scaler is important so that results from the linear regression model are not biased towards the features with the higher range values.
#Create pipeline that has standardization scaler.
pipeline = Pipeline([
('scaler',StandardScaler()),
('clf', LinearRegression())
])
#Fit the model created from pipeline onto the training data.
pipeline.fit(x_train, y_train)
for idx, col_name in enumerate(x_train.columns):
print("The coefficient for {} is {}".format(col_name, pipeline['clf'].coef_[0][idx]))
From this list of linear regresssion predicted coefficients, the newly made water_cem_ratio is the least contributive to the model, and may need to be eliminated in the revised version of the model after tuning the hyperparameters. The superplasticizer percentage variable is the most contributive of all features, followed by cement, and then by agedays (another new feature that captures the order of concrete ageing day periods.)
#R^2 value for training data
print("R-squared value for training data")
print(pipeline.score(x_train, y_train)) #Score for linear regression returns the R^2 coefficient (according to sklearn documentation)
print("")
#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
lrscores = cross_val_score(pipeline, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(lrscores)
print("Average of all R^2 iterations and 2 standard deviations")
print("R^2: %.3f (%.3f)" % (rfscores.mean(), rfscores.std()*2))
#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
lr_concrete_strength_predict = pipeline.predict(x_train)
print("Mean Squared Error for our Training data")
print(mean_squared_error(y_train, lr_concrete_strength_predict))
print("Mean Squared Error for our Validation data")
print(mean_squared_error(y_val, pipeline.predict(x_val)))
Looking at the results, R-squared values are close for the training and validation data. The mean squared error for training data was 44.3, and the mean squared error on the validation set was 51.5 on the validation set using k-fold cross-validation. From this, we can see there is a little bit of overfitting. Looking at the standard deviation from cross-validation, R^2 from the training data falls in the 95% confidence interval (calculated from the 2 standard deviations) from cross-validation.
This suggests that the model is tolerable in its goodness of fit, but there is still some overfitting issues that could be resolved by reducing some of the less important polynomial features in the linear equation.
#Heatmap to Measure Correlation (including new features)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conc.corr(), annot = True);
The heatmap shows that the new features (water_cem_ratio, and agg_sand_ratio) are highly correlated (>0.8). For this reason, they should be eliminated from the linear regression model because linear regression does not handle multicollinearity well. The water to cement ratio also does not contribute much towards the linear model from the features of importance list, so that would be another reason to remove this variable from the model.
#Model Tuning for linear regression
#Separate independent features from target feature.
X = conc.drop(['age','strength','water_cem_ratio','agg_sand_ratio'], axis=1) #Removing features that are less contributive and reducing collinearity.
# the dependent variable
Y = conc[['strength']]
#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)
#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)
#The problem with the linear regression model is that the independent variables have to be standardized so that each variable is on the same scale to limit bias towards features with larger value ranges.
#All the original variables (except for age) all have the same units of measurement and would normally not need to be standardized.
#Because we converted age to an ordinal variable and it is no longer continuous, ordinal variables do not need to be standardized.
#HOWEVER, because I created all these new features that are ratios, and percentages, and are no longer in the same scale as the other continuous variables, using a scaler is important so that results from the linear regression model are not biased towards the features with the higher range values.
#Create pipeline that has standardization scaler.
pipeline = Pipeline([
('scaler',StandardScaler()),
('clf', LinearRegression())
])
#Fit the model created from pipeline onto the training data.
pipeline.fit(x_train, y_train)
for idx, col_name in enumerate(x_train.columns):
print("The coefficient for {} is {}".format(col_name, pipeline['clf'].coef_[0][idx]))
Interpretation of the 2 most contributive coefficients:
For every superplasticizer percent increase, there is a decrease of 64.7 MPa in concrete strength. For every 1kg/m^3 increase in cement, there is a 12.1 MPA increase in concrete strength.
#R^2 value for training data
print("R-squared value for training data")
R2_train= pipeline.score(x_train, y_train) #Score for linear regression returns the R^2 coefficient (according to sklearn documentation)
print(R2_train)
#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
lrscores = cross_val_score(pipeline, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(lrscores)
print("Average of all R^2 iterations and 2 standard deviations")
R2_val_cross_val = "R^2: %.3f (%.3f)" % (lrscores.mean(), lrscores.std()*2)
print(R2_val_cross_val)
#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
lr_concrete_strength_predict = pipeline.predict(x_train)
print("Mean Squared Error for our Training data")
MSE_train = mean_squared_error(y_train, lr_concrete_strength_predict)
print(MSE_train)
print("Mean Squared Error for our Validation data")
MSE_val = mean_squared_error(y_val, pipeline.predict(x_val))
print(MSE_val)
print("")
We reduced the overfitting by a little, but it seems that there will be a little overfitting nonetheless. The mean squared error is still high for training and from cross-validation, suggesting that linear regression may not be the best fit model for this data, despite the ability to reduce the overfitting by a little. We can now test the model on the test data.
print("R-squared for test data")
R2_test = pipeline.score(x_test,y_test)
print(R2_test)
print("MSE of model on test data")
MSE_test = mean_squared_error(y_test, pipeline.predict(x_test))
print(MSE_test)
Clearly, the R^2 from the test data is within the range of the 95% confidence interval (0.586-0.978) that we achieved in the cross-validation. So, even though the model is a little overfit, linear regression is still a tolerable model (not the best fit model) to carry into production for predicting the strength of concrete with the more important features.
Linear_Regression ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Linear_Regression':[R2_train, R2_val_cross_val, R2_test, MSE_train, MSE_val, MSE_test]}
Linear_Regression_= pd.DataFrame(Linear_Regression)
Linear_Regression_
#Reset the x dataset back to the form before the linear regression model tuning because tree models like decision tree and random forest are robust.
#Separate independent features from target feature.
X = conc.drop(['age','strength'], axis=1)
# the dependent variable
Y = conc[['strength']]
#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)
#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)
#Decision Tree
#Decision tree doesn't require any standardization because the model is robust to the magnitude of variables.
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
dTree = DecisionTreeRegressor(random_state=1, criterion='mse', splitter='best', max_depth=10, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort='deprecated', ccp_alpha=0.0)
dTree.fit(x_train, y_train)
print("Decision Tree R^2 value for training data")
dt_R2_train= dTree.score(x_train,y_train)#"Score" is the coefficient of determination R^2 of the prediction for DecisionTreeRegression, according to sklearn
print(dt_R2_train)
#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
dtscores = cross_val_score(dTree, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(dtscores)
print("Average of all R^2 iterations and 2 standard deviation")
dt_R2_val_cross_val= "R^2: %.3f (%.3f)" % (dtscores.mean(), dtscores.std()*2)
print(dt_R2_val_cross_val)
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
print("Calculate mean squared error for training data")
dt_concrete_strength_predict = dTree.predict(x_train)
print("Mean Squared Error for our Training data")
dt_MSE_train= mean_squared_error(y_train, dt_concrete_strength_predict)
print(dt_MSE_train)
print("Mean Squared Error for Validation Data")
dt_MSE_val= mean_squared_error(y_val, dTree.predict(x_val))
print(dt_MSE_val)
This is definitely an overfit model. The difference between the R^2 values may not be that much different (0.996 vs. 0.823) but the difference is stark between the mean squared error for the training data and the negative mean squared error from the validation set. Overfitting the training data makes the mean squared error approach 0,and the mean squared error from the validation set is so far from 0. The confidence interval calculated from cross validation, despite its negative sign, does not even cover the training MSE.
We must prune the decision tree model or tune the hyperparameters. The most influential hyperparameters that reduce overfitting for a decision tree model are: max_depth and min_impurity_decrease. To do this, we would ideally want to decrease the max depth and increase the min_impurity_decrease. Since max_depth was set to none in our decision tree model from before, we should decrease it to 6 and increase the min_impurity_decrease to 1. These results were generated by hyperparameter tuning based off of training and validation prediction R^2 scores and the MSE scores from cross validation.
dTree = DecisionTreeRegressor(max_depth = 6, random_state=1, min_impurity_decrease=1, min_samples_leaf=2)
dTree.fit(x_train, y_train)
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
print("Decision Tree R^2 value for training data")
dt_R2_train= dTree.score(x_train,y_train)#"Score" is the coefficient of determination R^2 of the prediction for DecisionTreeRegression, according to sklearn
print(dt_R2_train)
#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
dtscores = cross_val_score(dTree, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(dtscores)
print("Average of all R^2 iterations and 2 standard deviation")
dt_R2_val_cross_val= "R^2: %.3f (%.3f)" % (dtscores.mean(), dtscores.std()*2)
print(dt_R2_val_cross_val)
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
print("Calculate mean squared error for training data")
dt_concrete_strength_predict = dTree.predict(x_train)
print("Mean Squared Error for our Training data")
dt_MSE_train= mean_squared_error(y_train, dt_concrete_strength_predict)
print(dt_MSE_train)
print("Mean Squared Error for Validation Data")
dt_MSE_val= mean_squared_error(y_val, dTree.predict(x_val))
print(dt_MSE_val)
#scores2 = cross_val_score(dTree, x_val, y_val, scoring='r2', cv=folds)
#print(scores2)
#print("R^2: %.3f (%.3f)" % (scores2.mean(), scores2.std()))
#print(dTree.score(x_test, y_test))
The R^2 isn't the best method to evaluate overfitting. Looking at the R^2 values for this newly revised model, the R^2 values for the training and cross validation are closer. The mean squared errors for the validation set and the training set are not that far apart either, so there is just a little bit of overfitting. No matter how much I prune and tune the hyperparameters, it will still be a little overfit. However, we tuned the model so that the MSE for the training is not so close to 0, and is able to be generalizable to the validation set, and possibly the test set. The model is decent to carry out to production, if the R^2 values displayed here are considered somewhat acceptable.
print("R-squared for test data")
dt_R2_test = dTree.score(x_test,y_test)
print(dt_R2_test)
print("MSE of model on test data")
dt_MSE_test = mean_squared_error(y_test, dTree.predict(x_test))
print(dt_MSE_test)
After testing on the testing data, the R^2 is within the acceptable 95% Confidence interval but the model is still somewhat overfit because the MSE is slightly more than the training and the validation test set.
Decision_Tree ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Decision_Tree':[dt_R2_train, dt_R2_val_cross_val, dt_R2_test, dt_MSE_train, dt_MSE_val, dt_MSE_test]}
Decision_Tree_= pd.DataFrame(Decision_Tree)
Decision_Tree_
#Random Forest
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(random_state = 100, n_estimators=100, max_depth=None, bootstrap=True)
rfr.fit(x_train, y_train)
#R-Square for Training Data
print("R-squared for training data")
rf_R2_train = rfr.score(x_train,y_train)
print(rf_R2_train)
#Cross-validation for R^2.
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
rfscores = cross_val_score(rfr, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(rfscores)
print("Average of all R^2 iterations and 2 standard deviations")
rf_R2_val_cross_val= "R^2: %.3f (%.3f)" % (rfscores.mean(), rfscores.std()*2)
print(rf_R2_val_cross_val)
print("")
#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
rf_concrete_strength_predict = rfr.predict(x_train)
print("Mean Squared Error for our Training data")
rf_MSE_train = mean_squared_error(y_train, rf_concrete_strength_predict)
print(rf_MSE_train)
print("Mean Squared Error for our Validation data")
rf_MSE_val= mean_squared_error(y_val, rfr.predict(x_val))
print(rf_MSE_val)
Based on these validation results, the R^2 value went down compared to the R^2 value for the training set. This shows that the model is not able to capture as much of the variance around the mean in the validation set.
The overfitting issue is corroborated by the values of the mean squared error, which is a better measure to examine goodness-of-fit for the model. The mean squared error was 4.3 on the training data but is 28 for the validation data. This is a stark difference, explaining how much the model is overfit. The fact how the training MSE is so close to 0 means that the data is so well-trained to the training data.
I will tune the hyperparameters to make the model more generalizable so that it may perform decently on the test data. To prevent overfitting in a random forest model, it is advised to increase the n_estimators (trees), and decrease the max_depth.
rfr = RandomForestRegressor(random_state = 100, n_estimators=120, max_depth=6)
rfr.fit(x_train, y_train)
#R-Square for Training Data
print("R-squared for training data")
rf_R2_train = rfr.score(x_train,y_train)
print(rf_R2_train)
#Cross-validation for R^2.
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
rfscores = cross_val_score(rfr, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(rfscores)
print("Average of all R^2 iterations and 2 standard deviations")
rf_R2_val_cross_val= "R^2: %.3f (%.3f)" % (rfscores.mean(), rfscores.std()*2)
print(rf_R2_val_cross_val)
print("")
#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
rf_concrete_strength_predict = rfr.predict(x_train)
print("Mean Squared Error for our Training data")
rf_MSE_train = mean_squared_error(y_train, rf_concrete_strength_predict)
print(rf_MSE_train)
print("Mean Squared Error for our Validation data")
rf_MSE_val= mean_squared_error(y_val, rfr.predict(x_val))
print(rf_MSE_val)
Compared to the model prior to hypertuning, this model is not as overfit. The difference between the mean squared errors is now only 16. This is the best random forest model obtained juggling with the tradeoffs of each hyperparameter. The increase in the training MSE suggests that the model is more generalizable now after hypertuning compared to the previous overfit model.
#Testing on the test data
print("R-squared for test data")
rf_R2_test = rfr.score(x_test,y_test)
print(rf_R2_test)
print("MSE of model on test data")
rf_MSE_test = mean_squared_error(y_test, rfr.predict(x_test))
print(rf_MSE_test)
The MSE for the testing data is a lot closer to the validation set MSE. The model is overfit. I tuned the parameters that are most influential for Random Forest to the best of my ability but the model still remains somewhat overfit.
Random_Forest ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Random_Forest':[rf_R2_train, rf_R2_val_cross_val, rf_R2_test, rf_MSE_train, rf_MSE_val, rf_MSE_test]}
Random_Forest_= pd.DataFrame(Random_Forest)
Random_Forest_
Support vector regression is also a possible regression choice to determine the strength of concrete. For this model, like linear regression, the dataframe has to be standardized so that all features are of the same scale and similar ranges. Additionally, we must re-remove features of multicollinearity like we did in linear regression.
#Remove the features of multicollinearity (like what we did for the regression model)
#Separate independent features from target feature.
X = conc.drop(['age','strength','water_cem_ratio','agg_sand_ratio'], axis=1) #Removing features that are less contributive and reducing collinearity.
# the dependent variable
Y = conc[['strength']]
#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)
#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)
###### SVM
from sklearn.svm import SVR
from sklearn.preprocessing import MinMaxScaler
from sklearn import decomposition, datasets
pipeline_svr = Pipeline([
('scaler',MinMaxScaler()),
('sv', SVR())
])
print("SVR R^2 training data")
pipeline_svr.fit(x_train,y_train)
SVR(C = 1, degree = 3, gamma=0.22, kernel = 'rbf', max_iter = -1, verbose=False) #Default settings
svr_R2_train = pipeline_svr.score(x_train,y_train)
print(svr_R2_train)
#R2 for SVR training set.
print("SVR R^2 cross validation")
print(svr_R2_train)
folds = KFold(n_splits = 10, shuffle = True, random_state = 100)
svrscores = cross_val_score(pipeline_svr, x_train, y_train, scoring='r2', cv=folds)
print(svrscores)
#Cross-validation R2 mean, and 2 standard deviations
svr_R2_val_cross_val= "R^2: %.3f (%.3f)" % (svrscores.mean(), svrscores.std()*2)
print(svr_R2_val_cross_val)
#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
svr_concrete_strength_predict = pipeline_svr.predict(x_train)
print("Mean Squared Error for our Training data")
svr_MSE_train = mean_squared_error(y_train, svr_concrete_strength_predict)
print(svr_MSE_train)
#Mean Squared Errors for Validation Set
print("Mean Squared Error for our Validation data")
svr_MSE_val = mean_squared_error(y_val, pipeline_svr.predict(x_val))
print(svr_MSE_val)
The R^2 value is not as high as the other models. The mean square error is really far from 0. There is overfitting, and after trying to tune the hyperparameters, the training and validation values don't seem to change. Perhaps, this is a good model to use for GridSearch CV.
#Testing on the test data
print("R-squared for test data")
svr_R2_test = pipeline_svr.score(x_test,y_test)
print(svr_R2_test)
print("MSE of model on test data")
svr_MSE_test = mean_squared_error(y_test, pipeline_svr.predict(x_test))
print(svr_MSE_test)
SVR ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'SVR':[svr_R2_train, svr_R2_val_cross_val, svr_R2_test, svr_MSE_train, svr_MSE_val, svr_MSE_test]}
SVR_= pd.DataFrame(SVR)
SVR_
#K-Nearest Neighbors
from sklearn.neighbors import KNeighborsRegressor
pipeline_knn = Pipeline([
('scaler',MinMaxScaler()),
('knr', KNeighborsRegressor())
])
pipeline_knn.fit(x_train,y_train)
KNeighborsRegressor(n_neighbors=100, p=2)
#KNN R2 Training
print("KNN R2 for Training data")
knn_R2_train = pipeline_knn.score(x_train,y_train)
print(knn_R2_train)
#K-Fold cross-validation for R2
folds = KFold(n_splits = 10, shuffle = True, random_state = 100)
knnscores = cross_val_score(pipeline_knn, x_train, y_train, scoring='r2', cv=folds)
print(knnscores)
print("KNN, R2, cross-validation")
knn_R2_val_cross_val = "R^2: %.3f (%.3f)" % (knnscores.mean(), knnscores.std())
print(knn_R2_val_cross_val)
#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
knn_concrete_strength_predict = pipeline_knn.predict(x_train)
print("Mean Squared Error for our Training data")
knn_MSE_train = mean_squared_error(y_train, knn_concrete_strength_predict)
print(knn_MSE_train)
print("Mean Squared Error for our Validation data")
knn_MSE_val = mean_squared_error(y_val, pipeline_knn.predict(x_val))
print(knn_MSE_val)
K Nearest Neighbors model is pretty good for this data, despite high overfitting. The R2 value is good, and can perhaps be improved if we ran GridSearch CV on this. Like SVR, tuning the hyperameters manually did not have much of a difference on controlling the overfitting of the model to the data.
#Testing on the test data
print("R-squared for test data")
knn_R2_test = pipeline_knn.score(x_test,y_test)
print(knn_R2_test)
print("MSE of model on test data")
knn_MSE_test = mean_squared_error(y_test, pipeline_knn.predict(x_test))
print(knn_MSE_test)
KNN ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'K Nearest_Neighbors':[knn_R2_train, knn_R2_val_cross_val, knn_R2_test, knn_MSE_train, knn_MSE_val, knn_MSE_test]}
KNN_= pd.DataFrame(KNN)
KNN_
#Reset the x dataset back to the form before the linear regression model tuning because tree models like decision tree and random forest are robust.
#Separate independent features from target feature.
X = conc.drop(['age','strength'], axis=1)
# the dependent variable
Y = conc[['strength']]
#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)
#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)
#Bagging Model:
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
bagr = BaggingRegressor(n_estimators=100, random_state=1) # Base estimator is DecisionTree
bagr = bagr.fit(x_train, y_train)
print("R2 Training Set")
bag_R2_train = bagr.score(x_train, y_train)
print(bag_R2_train)
#Cross-validation of R2 value on validation set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
bagscores = cross_val_score(bagr, x_train, y_train, scoring='r2', cv=folds)
print(bagscores)
print("Cross-Validation R2")
bag_R2_val_cross_val = "R^2: %.3f (%.3f)" % (bagscores.mean(), bagscores.std())
print(bag_R2_val_cross_val)
print("")
#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
bag_concrete_strength_predict = bagr.predict(x_train)
print("Mean Squared Error for our Training data")
bag_MSE_train= mean_squared_error(y_train, bag_concrete_strength_predict)
print(bag_MSE_train)
print("Mean Squared Error for our Validation data")
bag_MSE_val= mean_squared_error(y_val, bagr.predict(x_val))
print(bag_MSE_val)
print("Bagging Model R^2 Value for Test Data")
bag_R2_test = bagr.score(x_test, y_test)
print(bag_R2_test)
print("Bagging Model MSE Value for Test Data")
bag_MSE_test = mean_squared_error(y_test, bagr.predict(x_test))
print(bag_MSE_test)
After manually attempting to tune the hyperparameters, the model and results do not change much. The n_estimators, whether 100 or 1000 did not show any difference.
Bagging ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Bagging':[bag_R2_train, bag_R2_val_cross_val, bag_R2_test, bag_MSE_train, bag_MSE_val, bag_MSE_test]}
Bagging_= pd.DataFrame(Bagging)
Bagging_
#Gradient Boost Model
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
gbr = GradientBoostingRegressor(n_estimators=100, random_state=1) # Base estimator is DecisionTreeRegressor, max_depth=3
gbr = gbr.fit(x_train, y_train)
print("R2 training")
gb_R2_train = gbr.score(x_train, y_train)
print(gb_R2_train)
#Cross-validation of R2 value on validation set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
gradbscores = cross_val_score(gbr, x_train, y_train, scoring='r2', cv=folds)
print(gradbscores)
gb_R2_val_cross_val= "R^2: %.3f (%.3f)" % (gradbscores.mean(), gradbscores.std())
print(gb_R2_val)
print("")
#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
gradb_concrete_strength_predict = gbr.predict(x_train)
print("Mean Squared Error for our Training data")
gb_MSE_train = mean_squared_error(y_train, gradb_concrete_strength_predict)
print(gb_MSE_train)
#Mean Squared Error for Validation Set
print("Mean Squared Error for our Validation data")
gb_MSE_val= mean_squared_error(y_val, gbr.predict(x_val))
print(gb_MSE_val)
print("Gradient Boost R^2 Value for Test Data")
gb_R2_test = gbr.score(x_test, y_test)
print(gb_R2_test)
print("Gradient Boost MSE Value for Test Data")
gb_MSE_test = mean_squared_error(y_test, gbr.predict(x_test))
print(gb_MSE_test)
Gradient_Boost ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Gradient_Boost':[gb_R2_train, gb_R2_val_cross_val, gb_R2_test, gb_MSE_train, gb_MSE_val, gb_MSE_test]}
Gradient_Boost_= pd.DataFrame(Gradient_Boost)
Gradient_Boost_
Metrics_Dataframe = pd.merge(Linear_Regression_,Decision_Tree_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,Random_Forest_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,SVR_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,KNN_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,Bagging_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,Gradient_Boost_,how='outer',on='Metrics')
Metrics_Dataframe
Ultimately, when looking at all these models. They all overfit. However, some are not that bad. For example, the linear regression model is very minimally overfit. Decision Tree, and Gradient Boost had little overfit too. SVR and K Nearest Neighbors are the most overfit. Overall, Gradient Boost seem to have the best results because overfitting wasn't much. A lot of variation from the concrete strength mean was explained by the independent variables, which is evidenced by the high R^2 values in all 3 parts of the data. The MSE is also relatively low compared to all the models, but not to the point where overfitting is a really huge issue.
#Performing GridSearch CV on the Gradient Boost Model Validation Set
from sklearn.model_selection import GridSearchCV
search_grid={'n_estimators':[100,200,300,500],'max_depth':[1,2,4,6,8],'random_state':[1]}
gsgradientb=GridSearchCV(estimator=gbr,param_grid=search_grid,scoring='r2',n_jobs=1,cv=10)
gsgradientb.fit(x_val, y_val)
#Compare the results to the training using this Gridsearch CV technique.
grids_gradientb_R2_train = gsgradientb.score(x_train, y_train)
print(grids_gradientb_R2_train)
#Finding the best parameters
gsgradientb.best_params_
#This is R2 mean for the validation set iterations.
grids_gradientb_R2_val = gsgradientb.best_score_
grids_gradientb_R2_val
#Mean Squared Error for Training Set
print("Mean Squared Error for Training set")
grids_gradientb_MSE_train = mean_squared_error(y_train, gsgradientb.predict(x_train))
print(grids_gradientb_MSE_train)
#Mean Squared Error for Validation Set
print("Mean Squared Error for our Validation data")
grids_gradientb_MSE_val= mean_squared_error(y_val, gsgradientb.predict(x_val))
print(grids_gradientb_MSE_val)
#R^2 for test set
print("R2 for test set.")
grids_gradientb_R2_test= gsgradientb.score(x_test, y_test)
print(grids_gradientb_R2_test)
#Mean Squared Error for Test Set
print("Mean Squared Error for Test data")
grids_gradientb_MSE_test = mean_squared_error(y_test, gsgradientb.predict(x_test))
print(grids_gradientb_MSE_test)
Despite GridSearch being a strong technique used for hyperparameter tuning, the result generated here for the gradient boost model is worse. Although GridSearch gives the best parameters for the model, it does not guarantee not underfitting or overfitting. The gradient boost model constructed before seems to overfit very little compared to other models. This gradient boost-grid search cv model is actually underfitting because the validation set MSE score is so much less than the the MSE for the training data. Both values should be ideally similar.
The MSE from the test set is more similar quantitatively to the mean squared error for the training set.
Gridsearch also has certain limitations in that we have to input the values for the parameters we would like to test. It could be possible that the best parameters are not even among the ones we put into the model. That is a limitation of this technique.
GridS_Gradient_Boost ={'Metrics':['R2_train', 'R2_val', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Grid_Search_Gradient_Boost':[grids_gradientb_R2_train, grids_gradientb_R2_val, grids_gradientb_R2_test, grids_gradientb_MSE_train, grids_gradientb_MSE_val, grids_gradientb_MSE_test]}
GridS_Gradient_Boost_= pd.DataFrame(GridS_Gradient_Boost)
GridS_Gradient_Boost_
I originally thought that Random Forest would perform well because it is such a robust model. Standardization/Normalization is generally not necessary because the model is so robust. However, I was surprised how easy it was to have an overfit Random Forest Model. I decided to see if Grid Search could help improve the model by searching for the best parameters to change.
#Applying Grid Search CV to the Random Forest Model
param_grid_rf = {
'bootstrap': [True],
'max_depth': [1, 3, 7],
'max_features': [2, 3],
'min_samples_leaf': [3, 4],
'min_samples_split': [8, 10],
'n_estimators': [100, 150, 200]
}
grid_search_rf = GridSearchCV(estimator = rfr, param_grid = param_grid_rf,
cv = 10, n_jobs = -1, verbose = 2)
grid_search_rf.fit(x_val, y_val)
##Compare the results to the training using this Gridsearch CV technique.
grids_rf_R2_train = grid_search_rf.score(x_train, y_train)
print(grids_rf_R2_train)
grid_search_rf.best_params_
grids_rf_R2_val = gsadab.best_score_
grids_rf_R2_val
print("Mean Squared Error for our Training data")
grids_rf_MSE_train = mean_squared_error(y_train, grid_search_rf.predict(x_train))
print(grids_rf_MSE_train)
#Mean Squared Error for Validation Set
print("Mean Squared Error for our Validation data")
grids_rf_MSE_val= mean_squared_error(y_val, grid_search_rf.predict(x_val))
print(grids_rf_MSE_val)
Again, the validation data MSE is lower than the training MSE, suggesting possible underfitting. Although GridSearch is capable of finding the best parameters given the data you feed it, it nonetheless cannot promise not to overfit or underfit.
#R-square for Test Set
print("R-square for Test Set")
grids_rf_R2_test = grid_search_rf.score(x_test, y_test)
print(grids_rf_R2_test)
#Mean Squared Error for Test Set
print("Mean Squared Error for our Validation data")
grids_rf_MSE_test= mean_squared_error(y_test, grid_search_rf.predict(x_test))
print(grids_rf_MSE_test)
GridS_RF ={'Metrics':['R2_train', 'R2_val', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Grid_Search_Random Forest':[grids_rf_R2_train, grids_rf_R2_val, grids_rf_R2_test, grids_rf_MSE_train, grids_rf_MSE_val, grids_rf_MSE_test]}
GridS_RF_= pd.DataFrame(GridS_RF)
GridS_RF_
GridS_Dataframe = pd.merge(GridS_Gradient_Boost_,GridS_RF_,how='outer',on='Metrics')
print(pd.DataFrame(GridS_Dataframe))
Gradient_Boost_
Although Grid Search can find the so-called best parameters of a model whose parameters are specified, the original Gradient Boost model seems to be the best to answer the question. Overfitting is only a little, high R^2 value for train, validate and test sets that meet the requirements of a 85-95% accuracy model (R^2). The MSE train, validate and test are all small numbers but not too small to be completely overfit. If the Gradient Boost model were to advance into production, the mean R2 value would be expected to fall 95% of the time within the confidence interval of (0.766-0.956). This is a good range for R^2 values because it assures that the dependent variable's averages can be explained by the independent variables studied. Additionally, minimal overfitting is achieved as well in this model. Low MSE scores suggest the model's ability to understand the data and is generalizable to unseen data.
Resources: https://www.youtube.com/watch?v=eQzfyaqE0hE https://www.youtube.com/watch?v=Gx2vZ424t70 http://www.bnproducts.com/blog/what-are-the-proper-concrete-mix-proportions/ https://www.researchgate.net/post/The-percentage-of-superplasticizer-and-water-cement-ratio-in-a-high-performance-concrete-mix-design https://precast.org/2013/10/28-day-myth/#:~:text=A%20typical%20concrete%20compressive%20strength,after%20the%20date%20of%20manufacture.
https://medium.com/@akshaysjbit/linear-models-and-ols-use-of-cross-validation-in-python-f30955a3f2a https://benalexkeen.com/linear-regression-in-python-using-scikit-learn/ https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b
Decision Tree: https://www.dezyre.com/recipes/create-and-optimize-baseline-decision-tree-model-for-regression https://datascience.stackexchange.com/questions/31463/evaluation-metrics-for-decision-tree-regressor-and-knn-regressor/31467
Different Regression Models https://builtin.com/data-science/when-and-why-standardize-your-data https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff
K-Fold Crossvalidation: https://www.kaggle.com/jnikhilsai/cross-validation-with-linear-regression